Statistical Stemming for Kannada

نویسنده

  • Suma Bhat
چکیده

Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorithm on a sample from a text collection. We observe that a distance measure that rewards longest prefix matches is the best performing clustering-based stemmer. However, using a reasonably performing unsupervised morpheme segmentation seems to outperform the other stemming schemes considered.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nlp Challenges for Machine Translation from English to Indian Languages

This Natural Langauge processing is carried particularly on English-Kannada/Telugu. Kannada is a language of India. The Kannada language has a classification of Dravidian, Southern, Tamil-Kannada, and Kannada. Regions Spoken: Kannada is also spoken in Karnataka, Andhra Pradesh, Tamil Nadu, and Maharashtra. Population: The total population of people who speak Kannada is 35,346,000, as of 1997. A...

متن کامل

Development and standardization of Morningness-Eveningness Questionnaire (MEQ) in the Indian language Kannada.

INTRODUCTION A circadian rhythm is any biological process that displays an endogenous, entrainable, oscillation of about 24 hours; the rhythms driven by a circadian clock and sleep have been widely observed in plants, animals, fungi and cyanobacteria. The main aim of the current study was to translate and validate the Morningness-Eveningness Questionnaire (MEQ) to Kannada (MEQ-K). MATERIALS A...

متن کامل

A Maximum Entropy Approach to Kannada Part Of Speech Tagging

Part Of Speech (POS) tagging is the most important pre-processing step in almost all Natural Language Processing (NLP) applications. It is defined as the process of classifying each word in a text with its appropriate part of speech. In this paper, the probabilistic classifier technique of Maximum Entropy model is experimented for the tagging of Kannada sentences. Kannada language is agglutinat...

متن کامل

Adaptation of the Oswestry Disability Index to Kannada Language and Evaluation of Its Validity and Reliability.

STUDY DESIGN A translation, cross-cultural adaptation, and validation study. OBJECTIVE The aim of this study was to translate, adapt cross-culturally, and validate the Kannada version of the Oswestry Disability Index (ODI). SUMMARY OF BACKGROUND DATA Low back pain is recognized as an important public health problem. Self-administered condition-specific questionnaires are important tools for...

متن کامل

Named Entity Recognition and Classification in Kannada Language

Named Entity Recognition and classification (NERC) is an essential and challenging task in (NLP). Kannada is a highly inflectional and agglutinating language providing one of the richest and most challenging sets of linguistic and statistical features resulting in long and complex word forms, which is large in number. It is primarily a suffixing Language and inflected word starts with a root an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013